The U.S. Department of Education College Scorecard database is a rich source of information intended to help students and parents understand the true costs of attending college in America. This script leverages the database to compute an earnings premium and a value proposition for each college.
IMPORTANT: This is a quick-and-dirty stab at an analysis that serves as an exploratory data analysis. It’s not a commentary on the actual value represented by a college. Future earnings aren’t the sum total of the value of a college education, and this analysis is too crude to reliably capture even the earnings component.
The value proposition of a college is defined as the ratio of the adjusted annual earnings divided by the expected annual costs of attendendance.
The adjusted annual earnings is computed by (1) adding the earnings premium to the overall mean (across all colleges) of annual earnings 6 years after matriculation, and (2) multiplying this sum by the completion rate for the college to adjust for the probability of actually achieving a degree from the college.
The annual costs are computed across a range of financial need levels: without financial aid and with financial aid corresponding to the household income level (i.e., the net price for the household income level).
Load the kaggle.com version of the U.S. Dept. of Education College Scorecard Dataset and generate features for modeling using similar code as used in the previously submitted “Best Colleges for You” script. It uses package RSQLite to load the database, and we pare down the database to approximately 900 4-year Bachelor’s degree granting colleges.
## name
## Scorecard
Approximate Bayes factors serve as the features of the model. The Bayes factor is the ratio of the posterior-odds of the hypothesis (i.e., after receiving the evidence) to the prior-odds of the hypothesis (i.e., before seeing the evidence). The evidence is the attribue of a student – like “SAT score greater than 1400” – and the hypothesis is attendance at the college. (There may not be a compelling reason to do this, but informal testing in other applications indicates a greater ability to distinguish amongst the colleges in the Bayes factors feature space than in the raw variable space.)
We compute the Bayes factor, \(\operatorname{BF}_{\log10}\), i.e. applying \(\log_{10}\) to the posterior to prior odds ratio, using the following for the feature corresponding to attribute \(\text{Attribute }Y\) for each college of interest \(\text{College }X\), where \(E\) represents the evidence – in this case, observing a college student with \(\text{Attribute }Y\) – and \(H\) represents the hypothesis – in this case, the proposition that the student is enrolled at \(\text{College }X\) (we use log-base-10 on the Bayes factor.):
\[ \begin{aligned} &\operatorname{BF}_{\log10}( H = \text{College }X | E = \text{Attribute }Y ) \\ &= \log_{10}\frac{ \operatorname{Odds}( H = \text{College }X | E = \text{Attribute }Y )}{\operatorname{Odds}( H = \text{College }X ) }\\ &= \log_{10} \frac{\operatorname{P}( E = \text{Attribute }Y | H = \text{College }X )}{\operatorname{P}( E = \text{Attribute }Y | H = \lnot (\text{College }X) )}\\ &\sim \log_{10} \frac{ \operatorname{P}( E = \text{Attribute }Y | H = \text{College }X )}{\operatorname{P}( E = \text{Attribute }Y )}\\ \end{aligned} \]
The final approximation is a good one because there are 900 colleges in our working set; so the students attending \(\text{College }X\) are a very small proportion of the entire American student population. Therefore, the probability of finding a student with \(\text{Attribute }Y\) amongst students not at \(\text{College }X\) is basically the same as finding such a student amongst the entire student population (\(\text{College }X\) included).
Above, the flip in the propositions on each side of the conditional (“|”) when we go from \(\operatorname{Odds}(H|E)\) to probabilities, \(\operatorname{P}(E|H)\), occurs by applying Bayes’ Rule.
For some of the probabilities used in approximating the Bayes factors, the probabilities \(\operatorname{P}( E = \text{Attribute }Y | H = \text{College }X )\) are simply the reported proportions in the database.
But, the main challenge in using Bayes factors as features is computing the final probabilities for attributes for which only moments and quantiles are reported in the database. For such attributes, we first approximate a full probability distribution of students at the college of interest and then apply the proposition.
Take SAT for example. We are given the mean, median and quartiles for SAT scores of students at each college. For each college, we fit a model to these values to approximate the continuous distribution of SAT scores of students at the college. Then, for an attribute such as \(\text{Attribute }Y\) = “SAT score greater than 1400”, we compute \(\operatorname{P}( E = \text{Attribute }Y | H = \text{College }X )\) by simply computing the upper tail probability of SAT with lower limit 1400. Finally, to compute \(\operatorname{P}( E = \text{Attribute }Y )\), we simply take the weighted average, across all colleges, of the previously calculated conditional probability, where the weighting is simply the proportion of students attending \(\text{College }X\) amongst all students in our 900-college universe. In the database, field UGDS is the number of undergraduate students attending the college. So the proportion of all students attending \(\text{College }X\) is \(\operatorname{P}( H = \text{College }X ) =\) UGDS[College_X]/sum( UGDS[College_X] ).
For attributes like region and locale, to estimate \(P( E = \text{Attribute }Y | H = \text{College }X )\), we define \(\text{Attribute }Y\) to reflect preference. So, for, say Region == 'Great Lakes', \(\text{Attribute }Y\) is “prefers attending college in Great Lakes region”. And, we use our own subjective knowledge to just assume that of the students attending a college in a specific region (or locale), a non-zero proportion of them preferring attending colleges an alternative region (or locale) and that the alternative regions (locales) that are more similar to the one in which the college exists would have a higher proportion of such students than alternatives that are less similar. For regions, similarity is determined by distance – e.g., Mid-East is more similar to Great Lakes than is Far West. For locales, similarity is determined by character – i.e., campuses in suburbs of large cities are more similar to those in large cities than are remote rural campuses.
## Loading required namespace: DT
Value proposition is defined as the adjusted earnings divided by the expected costs for a household of the student’s specific income level, thus level of financial need. (Of course, other factors dictate financial need. Here, decreasing annual household income of the students in a cohort is the best we have as an indicator of the financial need of those students.)
Adjusted earnings is defined as the (grand mean) expected earnings plus the earnings premium for the college multiplied by the completion rate of the college. I.e., we assume completion rate is the probability of a student graduating from the college. Then the expected earnings would be this probability multiplied by the earnings of a graduated plus the quantity one minus this probability multiplied by the earnings of non-graduates, which for this analysis, we assume to be zero. So these values will, obviously, tend to under-estimate the actual estimated earnings (even after our controlling for the covariates in the model).
Again, this is more an exploration into roughly how the colleges sort out rather than anywhere near a determination of the actual value represented by a college degree from any of the colleges.
makeStudentValue <- function(studentBF,epremium,residence_state, income_bracket='gt48Kle75K',sat_lvl='gt1000le1200',unthresh=0 ){
student_value <- studentBF %>% select(-College,-outcome,-matches('Treasury|pell|Year2003')) %>%
filter( Year == 2013 ) %>%
left_join( student %>% select(Year,unitID, starts_with('NPT')), by = c('Year','unitID' )) %>%
left_join( epremium, by = 'unitID' ) %>%
mutate(
Utility = switch(
sat_lvl,
gt1400 = BF_SAT_gt1400,
gt1200le1400 = BF_SAT_gt1200le1400,
gt1000le1200 = BF_SAT_gt1000le1200,
gt800le1000 = BF_SAT_gt800le1000,
le800 = BF_SAT_le800
)
) %>%
arrange( desc(Utility) ) %>%
mutate(
ntp1 = ifelse(grepl('Public',CollegeType),NPT41_PUB,NPT41_PRIV),
ntp2 = ifelse(grepl('Public',CollegeType),NPT42_PUB,NPT42_PRIV),
ntp3 = ifelse(grepl('Public',CollegeType),NPT43_PUB,NPT43_PRIV),
ntp4 = ifelse(grepl('Public',CollegeType),NPT44_PUB,NPT44_PRIV),
ntp5 = ifelse(grepl('Public',CollegeType),NPT45_PUB,NPT45_PRIV)
) %>%
select( -starts_with('NPT4',ignore.case=FALSE)) %>%
#filter( complete.cases(.) ) %>%
mutate(
INSTATE = state == residence_state,
Living_Expenses = COSTT4_A - TUITIONFEE_IN,
maxcost = Living_Expenses + ifelse(INSTATE,TUITIONFEE_IN,TUITIONFEE_OUT),
ntp = switch(
income_bracket,
le30K = ntp1,
gt30Kle48K = ntp2,
gt48Kle75K = ntp3,
gt75Kle110K = ntp4,
gt110K = ntp5
),
Earnings_Adjusted = (Expected_Earnings + earnings_premium)*C150_4_POOLED_SUPP, # Assumes all remaining unexplained earnings are due to a premium.
inccost = ifelse( INSTATE, ntp, ntp + TUITIONFEE_OUT - TUITIONFEE_IN),
unorm = Utility/max(Utility), #10^(Utility-max(Utility)),
cnorm = maxcost/Living_Expenses,
c2norm = inccost/Living_Expenses,
vnorm = (Earnings_Adjusted/Living_Expenses) / (Earnings_Adjusted[[1]]/Living_Expenses[[1]]), # Earnings normalized by that of the highest utility school.
value_prop1 = vnorm/cnorm, #unorm*vnorm/cnorm,
value_prop2 = vnorm/c2norm, #unorm*vnorm/c2norm,
College = sprintf("%d. %s", order(Utility,decreasing=TRUE), gsub('University','U',College) )
) %>%
filter( unorm>unthresh, value_prop2 < 2*value_prop2[[1]]) %>%
mutate(
maxv2 = max(value_prop2[unorm>unthresh]),
value_prop1 = value_prop1/maxv2, #ifelse(scale_independently,max(value_prop1),max(value_prop2)),
value_prop2 = value_prop2/maxv2
) %>%
#filter(value_prop2 < 4 & vnorm < 2 ) %>%
gather( key = aid, value = value_prop, value_prop1, value_prop2 ) %>%
mutate(
cost = ifelse( aid == 'value_prop1', maxcost, ntp ),
coll_label = sprintf("%s,\n$%.1fK/$%.1fK",gsub("^(.+)[^a-zA-Z]+Main Campus","\\1",College),Earnings_Adjusted/1000,cost/1000)
) %>%
mutate( aid = ifelse( aid == 'value_prop1', "WITHOUT Financial Aid", "WITH Financial Aid" ) %>% factor(levels=c("WITHOUT Financial Aid","WITH Financial Aid")) )
}
plotValue <- function( stdt_val, residence_state, sat_lvl, income_bracket, scale_independently = FALSE, unthresh = 0, cost_thresh = 70000, earnings_thresh = 0 ){
xlims <- c(pmax(unthresh,floor(min(stdt_val$unorm)/0.1)*0.1)-0.05,1.15)
stdt_val %>%
filter( unorm > unthresh, cost < cost_thresh, Earnings_Adjusted > earnings_thresh ) %>%
{
ggplot(., aes( x = unorm, y = value_prop, color = CollegeType ) ) +
geom_point() +
geom_text( aes( label = coll_label), vjust = 1.0, size = 3 ) +
scale_x_continuous( limits = xlims, breaks = seq(xlims[1]+0.05,xlims[2]-0.05,by=0.10) ) +
ggtitle(
label = sprintf('Value Proposition With & Without Financial Aid (for %s resident, SAT = %s, Income = %s)',residence_state, sat_lvl, income_bracket ),
subtitle = paste(
'Value Proposition = Adjusted Earnings/Costs',
sprintf('Earnings > $%2.1fK, Cost < $%2.1fK', earnings_thresh/1000, cost_thresh/1000),
sep = "; "
)
) +
labs(
x = sprintf('Normalized Utility: Suitability for student with SAT level %s',sat_lvl),
y = ifelse(scale_independently,'Value Proposition (scaled independently)','Value Proposition')
) +
facet_wrap( ~ aid, scales = ifelse(scale_independently,'free','fixed') ) +
theme( text = element_text( face = 'bold' ) )
}
}
Here’s the calculation for an Ohio resident from a high-income household (greater than $110,000/yr) and whose SAT score is at the highest level, greater than 1400.
res_state <- 'Ohio'
sat_level <- 'gt1400'
inc_level <- 'gt110K'
student_value <- makeStudentValue(
studentBF,
epremium,
residence_state = res_state,
sat_lvl = sat_level,
income_bracket = inc_level
)
ethresh <- quantile( student_value$Earnings_Adjusted[1:30], 0.04 )
student_value %>%
plotValue(
residence_state = res_state,
sat_lvl = sat_level,
income_bracket = inc_level,
earnings_thresh = ethresh
)
And here’s the calculation for an Ohio resident from a high-income household (greater than $110,000/yr) and whose SAT score is at the lowest level, less than 800.
Notice how the adjusted earnings shown in the data point labels are significantly lower than those of the previous plot: In aggregate, the academic ability of a college’s students is a strong predictor of the future earnings of the students. (Says nothing about any individual’s case.) Also, see how financial aid makes the private colleges competitive in value to the public colleges by reducing the net price in the denominator of the formula.
Here’s the calculation for an Ohio resident from a low-income household (less than or equal to $30,000/yr) and whose SAT score is at the highest level, greater than 1400.
Notice how financial aid makes attending an elite private college a fantastic value for such a student.
Rather than continue plotting individual cases, sweep over SAT levels and financial need levels (i.e., houshold income levels) to create a lattice of plots.
sat_levels <- c('le800','gt800le1000','gt1000le1200','gt1200le1400','gt1400')
income_levels <- c('le30K','gt30Kle48K','gt48Kle75K','gt75Kle110K','gt110K','no_aid') %>% rev()
residence_state <- 'Ohio'
unthresh <- 0.4
student_value_all <-
income_levels %>% lapply(
function(income_bracket){
sat_levels %>% lapply(
function(sat_lvl){
studentBF %>% select(-College,-outcome,-matches('Treasury|pell|Year2003')) %>%
filter( Year == 2013 ) %>%
left_join( student %>% select(Year,unitID, starts_with('NPT')), by = c('Year','unitID' )) %>%
left_join( epremium, by = 'unitID' ) %>%
mutate(
Utility = switch(
sat_lvl,
gt1400 = BF_SAT_gt1400,
gt1200le1400 = BF_SAT_gt1200le1400,
gt1000le1200 = BF_SAT_gt1000le1200,
gt800le1000 = BF_SAT_gt800le1000,
le800 = BF_SAT_le800
)
) %>%
arrange( desc(Utility) ) %>%
mutate(
ntp1 = ifelse(grepl('Public',CollegeType),NPT41_PUB,NPT41_PRIV),
ntp2 = ifelse(grepl('Public',CollegeType),NPT42_PUB,NPT42_PRIV),
ntp3 = ifelse(grepl('Public',CollegeType),NPT43_PUB,NPT43_PRIV),
ntp4 = ifelse(grepl('Public',CollegeType),NPT44_PUB,NPT44_PRIV),
ntp5 = ifelse(grepl('Public',CollegeType),NPT45_PUB,NPT45_PRIV)
) %>%
select( -starts_with('NPT4',ignore.case=FALSE)) %>%
filter( !is.na(ntp1),!is.na(ntp2),!is.na(ntp3),!is.na(ntp4),!is.na(ntp5) ) %>%
mutate(
INSTATE = state == residence_state,
Living_Expenses = COSTT4_A - TUITIONFEE_IN,
maxcost = Living_Expenses + ifelse(INSTATE,TUITIONFEE_IN,TUITIONFEE_OUT),
ntp = switch(
income_bracket,
le30K = ntp1,
gt30Kle48K = ntp2,
gt48Kle75K = ntp3,
gt75Kle110K = ntp4,
gt110K = ntp5,
no_aid = maxcost
),
Earnings_Adjusted = (Expected_Earnings + earnings_premium)*C150_4_POOLED_SUPP, # Assumes all remaining unexplained earnings are due to a premium.
inccost = ifelse( INSTATE, ntp, ntp + TUITIONFEE_OUT - TUITIONFEE_IN),
unorm = Utility/max(Utility), #10^(Utility-max(Utility)),
cnorm = inccost/Living_Expenses,
vnorm = (Earnings_Adjusted/Living_Expenses) / (Earnings_Adjusted[[1]]/Living_Expenses[[1]]), # Earnings normalized by that of the highest utility school.
value_prop = vnorm/cnorm, #unorm*vnorm/cnorm,
coll_rank = order( Utility, decreasing = TRUE ),
College = sprintf("%d. %s", coll_rank, gsub('University','U',College) )
) %>%
#filter(value_prop2 < 4 & vnorm < 2 ) %>%
mutate(
cost = ntp,
coll_label = sprintf("%s,\n($%.1fK/$%.1fK; %.0f%%)",gsub("^(.+)[^a-zA-Z]+Main Campus","\\1",College),Earnings_Adjusted/1000,cost/1000,C150_4_POOLED_SUPP*100),
SAT = factor( sat_lvl, levels = sat_levels ),
Need = factor( income_bracket, levels = income_levels )
)
}
)
}
) %>%
unlist( recursive = FALSE ) %>%
{ do.call( bind_rows, . ) } %>%
#filter( unorm>unthresh, value_prop2 < 2*value_prop2[[1]]) %>%
# mutate(
# maxv = quantile(value_prop[unorm>unthresh],0.99),
# value_prop = value_prop/maxv #ifelse(scale_independently,max(value_prop1),max(value_prop2)),
# ) %>%
filter( value_prop < 10 ) # To avoid outliers
The following plots sweep across SAT level, as columns, and financial need level, as rows. Again, as above, the plots are for a resident of Ohio, so out-of-state tuition and fees apply for public schools outside of Ohio.
(You can fork the script and edit it to test with different states as that in which the student resides.)
scale_independently <- TRUE
rank_thresh <- 50
cost_thresh <- 100000
earnings_thresh <- 0.0
student_value_all %>%
filter( coll_rank <= rank_thresh ) %>%
group_by( SAT ) %>%
filter( coll_rank<=10 | (coll_rank>10 & Earnings_Adjusted>earnings_thresh & cost < cost_thresh) ) %>%
ungroup() %>%
{
ggplot(., aes( x = unorm, y = value_prop, color = CollegeType ) ) +
geom_point( na.rm = TRUE ) +
scale_x_continuous(limits = c(unthresh-0.1,1.1), breaks = seq(unthresh-0.1,1.1,by=0.1)) +
geom_text( aes( label = coll_label), vjust = 1.0, size = 3, na.rm = TRUE ) +
ggtitle(
label = sprintf('Value Proposition (for %s resident)',residence_state),
subtitle = paste(
'Value Proposition = Adjusted Earnings/Costs',
sprintf('Earnings > $%2.1fK, Cost < $%2.1fK', earnings_thresh/1000, cost_thresh/1000),
sep = "; "
)
) +
labs(
x = 'Normalized Utility (Suitability for student at SAT level)',
y = ifelse(scale_independently,'Value Proposition (scaled independently)','Value Proposition')
) +
theme( text = element_text( face = 'bold' ) ) +
facet_grid(
Need ~ SAT,
scales = ifelse(scale_independently,'free','fixed'),
labeller = label_both
)
}
Now the same plot but with the y-axes fixed.
Finally, plot the lattice one more time, allowing the y-axis of every plot to move independently.
### Interpretation
The labeling of the rows by “Need” includes the income bracket to which the student’s household belongs. In other words, the panels in that row correspond to the value proposition for a student whose need level is typical for a household of that income bracket. The top row corresponds to a need level of “no_aid”, meaning the household is wealthy enough to not qualify for any financial aid and would pay the full cost of tuition, fees and living expenses.
Note that within a row, the y-coordinate of the college, which is the value proposition estimated for any specific college given the financial need level specified for that row, does not change. That y-value for a given college stays the same as SAT level changes because the inputs to the value proposition equation – the adjusted earnings, the costs at that row’s need level, and the completion rate – are only properties of the college and not the student and therefore they all stay the same as the student’s SAT level changes. But a different mix of colleges appears in each panel within a row because only the top-100 colleges, in terms of suitability as determined by the Bayes factor at that row’s SAT level, are included in each panel, and they change with change in SAT level.
Also note, that within a column, the x-coordinate of a college, which is the suitability of the school for a student scoring at the SAT level specified for that column, does not change. Of course the y-values do change because the estimated costs, which is the denominator in the value proposition equation, decreases with increasing financial need.
You can see that the value proposition stays roughly the same as we increase SAT level, moving across the columns from left to right. Also, as student SAT level increases, the mix of suitable colleges transitions from many public colleges to primarily the elite private colleges, like the Ivies, MIT, CalTech, Stanford, etc.The fact that the mix of colleges also changes with SAT level indicates that for each level of student ability, there is a set of colleges that fit right into the economic niche to serve them. However, there is a bit of a dip in value proposition in the middle column, corresponding to students from households with incomes near the median American income. This issue might be a cause for concern in our nation….
Also, the adjusted earnings (shown in the data point labels) for the colleges populating the panels in the leftmost columns are significantly lower than those for the colleges populating the panels in the rightmost columns. This indicates that the expected earnings for low-SAT students are dramatically lower than those of high-SAT students. (Because we adjust the earnings to account for the SAT distribution at each college, this drop in adjusted earnings for the low-SAT students is driven mainly by dramatically lower completion rates at the colleges most suitable for them that at the colleges for high-SAT students.)
The value proposition increases dramatically as we increase the financial need level, moving down the rows from top to bottom. As more and more financial aid kicks in, the costs of the colleges drop significantly (while adjusted earnings and completion rates for each school stay constant). In fact the value proposition for a high-achieving student, at SAT level greater than 1400, who comes from a household with great financial aid, at income level less than $30,000/yr, the Ivy League schools pose an unparallel value proposition at the highest suitability levels for that student.
The underlying dataset and the methods applied here really are not suitable for drawing any strong conclusions about the value proposition of a specific college for a specific student. This analysis merely demonstrates some of the considerations one might need to address in investigating the value proposition of colleges. It would be interesting to see how the quick-and-dirty results shown here compare to a more rigorous analysis and to published sources of value comparisons and value rankings of colleges.
Hope you found this to be interesting and it prompts you to dig deeper and to do more….
=======================================================================================
Copyright 2017 Michael L. Thompson
=======================================================================================
“collegeValue.Rmd” reapplies portions of “BestCollegeforYou_KaggleSubmission.Rmd” originally submitted to the kaggle.com competition “US Dept of Education: College Scorecard” under the Apache License, v. 2.0. at https://www.kaggle.com/apollostar/d/kaggle/college-scorecard/which-college-is-best-for-you
=======================================================================================
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
=======================================================================================